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Abstract 

Background: SCOP is a hierarchical domain classification system for proteins of known structure. The superfamily 
level has a clear definition: Protein domains belong to the same superfamily if there is structural, functional and 
sequence evidence for a common evolutionary ancestor. Superfamilies are sub-classified into families, however, 
there is not such a clear basis for the family level groupings. Do SCOP families group together domains with 
sequence similarity, do they group domains with similar structure or by common function? It is these questions we 
answer, but most importantly, whether each family represents a distinct phylogenetic group within a superfamily. 

Results: Several phylogenetic trees were generated for each superfamily: one derived from a multiple sequence 
alignment, one based on structural distances, and the final two from presence/absence of GO terms or EC numbers 
assigned to domains. The topologies of the resulting trees and confidence values were compared to the SCOP 
family classification. 

Conclusions: We show that SCOP family groupings are evolutionarily consistent to a very high degree with respect 
to classical sequence phylogenetics. The trees built from (automatically generated) structural distances correlate 
well, but are not always consistent with SCOP (hand annotated) groupings. Trees derived from functional data are 
less consistent with the family level than those from structure or sequence, though the majority still agree. Much of 
GO and EC annotation applies directly to one family or subset of the family; relatively few terms apply at the 
superfamily level. Maximum sequence diversity within a family is on average 22% but close to zero for 
superfamilies. 



Background 

Proteins are made up of domains. Protein domains in 
this context can be regarded as the building blocks of 
proteins, and the smallest units of protein evolution. A 
small protein may consist of a single domain, larger pro- 
teins maybe contain multiple domains. A domain can be 
defined as a protein unit which is seen in nature either 
on its own or in combination with other different domains. 

Detecting the evolutionary relationship between two 
or more domains using sequence information alone is 
often not possible, as sequences often diverge beyond 
the point of detection by comparison methods. Lack of 
sequence information does not necessarily show that 
there is no relationship between domains. If the three di- 
mensional structure of the domains is known, evolutionary 
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relationships can usually be recognised. The Structural 
Classification of Proteins (SCOP) [1-3], is a hierarchical 
classification system of proteins for which atomic reso- 
lution three dimensional structures are known; units in 
SCOP are protein domains. The SCOP classification takes 
protein structures published in the Protein Data Bank 
(PDB) [4] as the primary data source from which the do- 
main classification is derived. The classification of do- 
mains is based on both manual curation and automatic 
methods, the balance of which has resulted in a classifi- 
cation system which is regarded as the 'gold standard', 
and is an essential bioinformatics resource. 

Levels of classification in SCOP from the top down 
are: class, fold, superfamily, family. A class is just a con- 
venient grouping, e.g. domains containing only alpha- 
helices. Folds and superfamilies have a clear and precise 
definition of what they are supposed to represent: a fold 
groups together domains which have the same topo- 
logical arrangement of secondary structure; a superfamily 
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groups together domains which share a common evolu- 
tionary ancestor. The family level sub-groups domains 
within a superfamily, but unlike the other levels lacks a 
precise definition. The first SCOP paper [1] states 30% 
sequence identity between members of a superfamily as 
significant support for a family grouping. However, in the 
first release of SCOP there were far fewer protein struc- 
tures available (a total of 13073 domains), and selecting 
an arbitrary sequence identity cutoff was possible. There 
are now nearly ten times the number of domains (110800 
as of SCOP 1.75). The family level of the classification 
further draws on structure and functional information in 
the absence of strong sequence similarity, but the mean- 
ing and the properties of the family object in SCOP 
remains unclear. 

Many projects have been based on the SCOP classifi- 
cation leading to several thousand citations [5-8]. Most 
of these projects make use of the clear evolutionary def- 
inition of a domain, and of a superfamily, so a better 
understanding of the family level will add value to future 
work which makes use of SCOP, and enable new research 
questions to be addressed. The research presented in this 
paper was carried out in order to elucidate the meaning 
and significance of the SCOP family level, in particular 
with regard to sequence, structure and function and their 
relationships to family classification. 

We also draw on protein functional information taken 
from gene ontology (GO) terms [9]. GO is a standar- 
dised vocabulary for depicting gene products in three 
biological concepts: Biological Process, Molecular Func- 
tion and Cellular Component. Since many proteins are 
enzymes Enzyme Commission (EC) [10] numbers can 
also aid in the understanding of protein function. 

Results and discussion 

To understand the meaning of a family, we compared 
the groupings of domains in SCOP to determine the 
similarity to automatically generated groupings based in- 
dependently on the three aspects we wished to investi- 
gate: sequence, structure and function. Since we begin 
without a pre-conceived idea of the granularity or size/ 
depth of the groupings it is necessary to generate the 
automatic groupings at every possible level. This is repre- 
sented by a tree which is the result of hierarchical clus- 
tering of the domains based on one of the three sources 
of information: sequence similarity, structural similarity, 
functional labels (in the forms of Gene Ontology and En- 
zyme Classification). The level of agreement between one 
type of information and the grouping of a SCOP family 
can be assessed by asking whether each edge in the tree 
divides domains into family groups, or splits a family, 
grouping together domains from different families. 

The ROC curve Figure 1 shows the number of 
disagreements/agreements of the trees produced from 



sequence, structure and functional data with the SCOP 
family classification for varying confidence values. For se- 
quence, confidence is ranked by bootstrap percentages, 
for structural data the confidence is based on the struc- 
tural distance scores, and for function, confidence is 
based on the total number of terms which suggest a par- 
ticular clade in the trees. See materials and methods for 
details of a web resource providing all data and trees. 

Sequence 

Within the literature there is variation in suggested lev- 
els for the minimum informative bootstrap confidence 
[11,12], with most suggesting about 70-80% required for 
confidence. We found that from 2046 families across 
428 superfamilies, 99.6% of the phylogenetic trees agree 
with the SCOP groupings for bootstrap values above 
80%. We also found that, although less reliable, there is 
useful information which can be acquired from the trees 
for bootstrap values down to 60%. These results show 
that, to the extent to which sequence information can re- 
liably determine evolutionary relationships, SCOP family 
groupings are evolutionarily consistent. Classical sequence 
phylogenetics are quite reliable for high bootstrap values, 
but are limited in the evolutionary distance over which 
they can resolve relationships. There are plenty of SCOP 
family groupings which sequence-based phylogenetics 
alone is unable to determine with high confidence - the 
low confidence parts of the tree. Although the classical 
phylogenetic analysis cannot inform us directly about the 
evolutionary consistency of many family groupings, the 
fact that there is such strong agreement with those that it 
can, gives a strong suggestion that the others (classified 
independently from this information) are also likely to be 
evolutionarily consistent. 

The top 13 edges which conflicted with the sequence 
trees were examined. These are shown in a table in 
Figure 2, along with an example of each type of disagree- 
ment. The most frequent disagreement was from families 
which were classified not long after the creation of SCOP. 
These families were classified at a time when PFAM [13] 
sequence data was not available, and therefore did not 
provide evidence in the curation of SCOP families. Se- 
quence information from PFAM is now a contributing 
factor of data used to guide the classification. An ex- 
ample is shown in Example 1. We also find examples 
such as that shown in Example 2, where a family has 
been decided in SCOP based on function. Trees based on 
both sequence and structure place the single domain 
Pancreatic carboxypeptidases family between domains 
for a different family causing a disagreement of the trees 
with SCOP families. In this case the classification of a do- 
main into a new family of its own was likely based on a 
functional signal, however the tree based on function 
places the domain in a similar way to that of structure 
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Figure 1 The number of superfamily agreements/disagreements with SCOP for varying confidence values. A ROC curve showing the 
number of superfamilies containing agreements against the number containing disagreements of trees with SCOP's groupings, for confidence 
values decreasing from left to right. For sequence trees, confidence is based on the bootstrap value assigned to an edge. Structures are ranked 
using the total structural distance, and function is ranked by the total number of GO terms or EC numbers which support an edge. 



and sequence suggesting the domain should probably be- 
long to the surrounding family. Our method classes 
'nested families' as inconsistent with evolution (shown in 
Example 3), whereby one family grows from another in 
the tree. In some sense this is more a reflection of the 
limited number of levels in the hierarchy, suggesting that 
there are some families that actually represent a 'sub' 
family of another. We also find a small number of other 
artefacts, where is a family classification based on the 
source species. This is can happen with proteins found in 
viruses. We also see cases such as duplications of do- 
mains grouped within the same family, an illustration of 
this is shown in Example 4. 

A potential factor which contributes to the disagree- 
ments seen in trees calculated from sequence data com- 
pared to those from the other data sources is also worth 
noting. Diverse superfamilies with very low sequence 
identity between member domains may provide an unre- 
liable multiple sequence alignment thereby creating a re- 
sult tree with limited accuracy. Anomalies introduced 
from this effect are more likely to be seen in very large 
superfamilies with a great deal of structural variation. 

Structure 

The trees built from automatically generated structural 
distances largely agree, but are not always consistent with 
SCOP s hand annotated groupings. The hand classification 



of structures in SCOP at the superfamily and fold levels 
is often referred to as the gold standard in the field, and 
clearly surpasses any fully automatic method. Since de- 
tectable structural similarity remains long after se- 
quences have diverged beyond the point of recognition, 
the structurally-derived trees are able to resolve deeper 
edges of the tree with higher confidence than the sequence- 
based ones (the intersection of the red and blue lines in 
Figure 1). That the trees are largely in agreement with 
the family classification indicates that SCOP is also evo- 
lutionarily consistent at greater divergence distances. The 
differences we see could either be cases where SCOP has 
grouped domains based on some criterion other than 
evolution (e.g. common function), or may be due to geo- 
metric structural distance being in some cases a poor 
measure of divergence. For some proteins, changes to 
the structure of a binding site may be the best indication 
of evolutionary divergence, but these changes make a 
relatively small contribution to the automatic superpos- 
ition of the whole body. Conversely, movements of sec- 
ondary structures relative to each other, e.g. a change of 
angle between beta-sheets [15], can cause dramatic changes 
in superposable structural distance which mask the true 
relationships. In this way structural geometric distance 
does not always equate to evolutionary distance. 

Examining high ranking disagreements between the 
SCOP family classification and structural trees can mostly 
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Example 1 

Superfamily (56300) 
Metallo-dependent phosphatases 

*143938.d2cv9al 
*118155.dlt70a_ 
*118155.dlt71a_ 
*111233.d2a22al 
*111233.dlz2wal 
103320.dluf3a_ 

Sequence tree - confidence 100% 

Conclusion: Families 143938 and 
118155 share the same additional 
structural feature to the superfamily 
common fold and therefore should 
have been merged in one rather 
than split to make three families. 




Example 2 

Superfamily (53187) 
Zn-dependent exopeptidases 




*53188.dluwya2 
*53188.dlh81a2 

, *53188.dlm41a_ 

*53188.dlpcabl 
*53188.dldtda_ 
*53188.d2bo9al 
*53188.dlz5ral 
*53188.dlnsaal 
*53188.dlkwmal 
*53188.d2clcal 
*53188.dljqgal 
*53198.dlobra_ 

Sequence tree - confidence 89% 

Conclusion: dlobra_ was initially 
classified by function. Sequence 
gap between families was bridged 
by recent structures suggesting 
family 53198 should probably 
be part of 53188 



Example 3 

Superfamily (49723) 
Lipase/lipooxygenase domain 

■ *49738.dlolpa2 

I | *49738.dlkhoa2 

I I *49738.dlcala2 

| *49724.dlf8na2 

*49724.dlrrha2 

*49724.dlloxa2 

*49730.dlbu8al 

Sequence tree - confidence 77% 

Conclusion: Nested Family Effect 
In this example superfamily 49723 
is observed in three different 
two-domain architectures, hence 
the three different families 



Example 4 (continues below) 

Superfamily (48113) 
Heme-dependent peroxidases 

Family 74753 is a tandem repeat 
whereas 48114 is a single unit, 
the family and function apparently 
being determined by the number 
of units. 

The sequence tree suggests that 
a duplication-fusion of the single 
unit domain created the tandem 
domain. The placement of the 
red domains as an ingroup 
suggests a subsequent fission. 



*48114.d2eutal 

*48114.dloafa_ 
*48114.dlapxa_ 
*48114.dliyna_ 

*74753.dlmwval 
*74753.d2ccaal 
*74753.dlitkal 
*74753.dlub2al 
*74753.dlu2ka_ 
*74753.dlub2a2 

*74753.dlmwva2 
*74753.d2ccaa2 
*74753.dlitka2 
*48132.dlq4gal 
*48132.dlcvual 

*48114.dlqpaa_ 

*48114.dlb80a_ 
*48114.dlllpa_ 
*48114.dlyydal 
*48114.d2e3bal 




Sequence tree 

Figure 2 (See legend on next page.) 



j- *74753.dlmwva2 
ffl— *74753.d2ccaa2 

I T|L *74753.dlitka2 

L *74753.dlub2a2 

I *74753.dlu2ka_ 

r *74753.dlub2al 
1- *74753.d2ccaal 
-Ttl. *74753.dlmwval 
l_ *74753.dlitkal 

I *48114.d2eutal 

j- *48114.dloafa_ 
"L *48114.dlapxa_ 
|— *48114.d2e3bal 
J _ *48114.dlb80a_ 
— P-P- *48114.dlUpa_ 
L *48114.dlqpaa_ 
I— *48114.dlyydal 



Structural tree 



*48114.dlbgpa_ 

*48114.dlgwua_ 

*48114.d2eutal 

*48114.dlpa2a_ 

*48114.dlqgja_ 

*48114.d2e3bal 

*48114.dlfhfa_ 



*74753.dlmwva2 

*74753.dlub2al 

*74753.dlitka2 

*74753.d2ccaa2 

*74753.d2ccaal 

*74753.dlitkal 

*74753.dlu2ka_ 

*74753.dlub2a2 

*74753.dlmwval 



Function (GO) tree 
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(See figure on previous page.) 

Figure 2 Examples of disagreements with SCOP. Examples of SCOP superfamilies which contain a disagreement found with trees based on 
sequence information, supported by high confidence values. Four of the common reasons for disagreement are explained. Images produced 
with TreeVector [14]. 

v J 



be explained by the above, however one exception is 
shown in Example 2 from Figure 2. This example shows 
a sequence tree but we see the same disagreement when 
we look at the structural tree, and so in this case it sug- 
gests the possibility of a mis-classification. 

Function 

The lines for EC numbers and GO terms shown in 
Figure 1 are smaller and less smooth than the others. 
This is because confidence values are generated using 
the total number of independent features that support a 
particular edge of the tree. There are not very many GO 
features per tree and barely any for EC number. This is 
partly due to a lack of richness in the ontological hier- 
archy but also due to the incompleteness of the annota- 
tion of the domains with terms. Trees derived from both 
GO and EC functional data are less consistent with the 
family level than trees derived from structure or se- 
quence, though the majority still agree with the classifi- 
cation. This may be due to the low quality of the derived 
functional dataset, most commonly the lack of func- 
tional annotation for a particular domain. Functions are 
also appended to the protein chain rather than individ- 
ual domains, therefore terms may be uninformative for 
two domains found within the same protein. The fact 
that the correlation with function is so much weaker 
than sequence and structure suggests that although 
function may guide the choice of granularity or level of 
grouping of families in SCOP (see section on Distribu- 
tion of GO terms), it is not a primary source of informa- 
tion for determining relationships. 

Dustbin families 

In SCOP all domains must belong to a family, so a 
superfamily with a single member must also have a sin- 
gle family. As more structures are added to a superfam- 
ily over time, there may be new additions that have 
enough in common to group them apart from the rest 
and a second family is created to hold them. If this hap- 
pens successively the result is that some families contain 
domains with something in common, but any leftovers 
lacking common features with each other may remain in 
the original family that contained the first member of 
the superfamily. These non-specific families are referred 
to here as 'dustbin families'. The 'dustbin families' line in 
Figure 1 is derived from the same trees as for the stand- 
ard domain sequences line, but the rules by which edges 
are defined as conflicting are adjusted to not penalise 
for the presence of a single dustbin family in each 



superfamily. Remarkably, despite expectations, the re- 
sults show that they are not a major feature of the SCOP 
classification. 

Sequence identity 

Figure 3 shows the maximum sequence divergence be- 
tween any two members of a family or superfamily, i.e. a 
measure of the divergence within the family or super- 
family. The analysis of sequence distances shows that 
the maximum sequence diversity for domains grouped 
within a family is on average 22% with the majority of 
families having a maximum sequence distance of 10- 
30%. Superfamilies on the other hand have a sequence 
diversity spread of 8% and below, with the average being 
close to zero. While it is well known that remote hom- 
ology detection at the superfamily level is a difficult 
problem, the data show that about half (169) of the 341 
families (the most divergent family within each of the 
341 superfamilies in the analysis) contain members with 
no less than 20% sequence identity. 

Figure 4 shows the maximum structural distance found 
between two members of the same superfamily or family. 
The distribution shows that the maximum structural dis- 
tances are greater between two members of the same 
superfamily than to two domains grouped in the same 
family. 

It is clear from the distribution in the graph in Figure 3 
that SCOP families are not selected by simply choosing 
a random sequence identity cutoff, and that the process 
of curation is much more elaborate. 

Distribution of GO terms 

Figure 5 shows the distribution of GO terms annotated 
to single domains across SCOP. We see that approxi- 
mately 1/3 of GO and EC annotation applies directly to 
one family, another 1/3 to a subset of a family, and the 
remaining 1/3 scattered across multiple superfamilies, 
with strikingly few terms that apply at the superfamily 
level. One would expect that the terms in the sub-family 
would be lower down the GO hierarchy and those span- 
ning multiple superfamilies would be broader terms 
found higher up the hierarchy, but the distribution across 
the GO hierarchy is quite similar in each of the three 
major segments of the pie chart shown in Figure 5. This 
distribution does not change significantly when looking 
at each of the three ontologies of GO (molecular func- 
tion, cellular localisation, biological process) separately. 
A more detailed view is shown in Additional file 1: Table 
SI in additional files. 
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Figure 3 Sequence divergence in families and superfamilies. Graph shows the maximum sequence diversity between two members of the 
same superfamily (or family) in SCOP. Domains which continue to diverge beyond detectable sequence identity have their distribution collapsed 
to the far left side of the graph; the large number with zero percent sequence identity represent those cases in which BLAST was unable to find 
alignment. 



Despite the weak link between SCOP family classifica- 
tion and the edges of trees representing functional data, 
we see a very large proportion of functional terms corre- 
sponding to exactly one family, and almost none close to 
the superfamily level This suggests that the relationships 
between members of a superfamily and their distance 
apart is evolutionary, having been based on evidence 
from structure and sequence (not function), but the 
granularity at which to divide the members of a super- 
family is decided by function. I.e. domains are not grouped 
based on their function, but the number of groups relates 
to the number of functions. 



evolutionary relationships of families in a superfamily 
where sequence identity is low. We see that although 
function does not determine the relationships, i.e. edges, 
it is used to guide the level at which the tree is cut to 
make a family, i.e. the choice of node from which to de- 
rive a clade (granularity). 

The families in SCOP represent a level at which se- 
quence, structure, function plus other information on a 
shared peculiarity must all be taken into account. A bal- 
ance of the strengths of signals available is used to es- 
tablish the evolutionary relationships and resolve the 
groupings. 



Conclusions 

Sequence information contributes to the classification of 
domains into families, but alone is not enough. To clas- 
sify a family evolutionarily: it must be consistent with se- 
quence phylogenetics, will likely draw on structural 
distance, and will often coincide with a particular func- 
tion. Sequence diversity between families (within a 
superfamily) is considerably greater than within a family. 
Sequence phylogenetics do not give a strong enough sig- 
nal at the superfamily level to classify families, but where 
there is a signal it is consistent with the SCOP classifica- 
tion. Structural information is necessary for identifying 



Methods 

The data for all trees used to generate Figure 2 are avail- 
able as a web resource at http://supfam2.cs.bris.ac.uk/ 
pethica/scopresults. The data may be ranked on each of 
the confidence scores separately or together. For every 
superfamily there are tree images for sequence, structure 
and function annotated with the PDB domain and SCOP 
family ID as shown in Figure 2. The tree data can add- 
itionally be downloaded in Newick format [16]. Also 
available are all the matrices of Structal [17] data used to 
generate the structural distance trees. 
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Figure 4 Structural divergence in families and superfamilies. Graph shows the maximum structural diversity between two members of the 
same superfamily (or family) in SCOP. Structural distances used are the scores produce by Structal for the alignment of two domains. 



Sequence based phylogeny 

Domain sequences for SCOP version 1.73, filtered to 
95% sequence identity were obtained from ASTRAL [18]. 
The complete set of sequences was filtered to remove 
superfamilies for which SCOP's family level classification 
could not be contested. These cases included superfam- 
ilies containing a single family, those where each family 
contained only one member, and any superfamily made 
up of three or less domains. A detailed breakdown of the 
number of domains, families, and superfamilies used in 
the analysis can be found in Additional file 2: Table S2. 

For each superfamily in the classification the sequences 
of assigned domains were used to produce an align- 
ment using MUSCLE [19]. Alignments were converted 
to Stockholm format using sreformat which is part of 
the HMMER package [20]. Quicktree [21], a fast imple- 
mentation of the neighbour joining algorithm was used 
to produce runs of both 300 and 600 bootstrap replicate 
trees from the sequence alignments. Phylip Consense 
was used to create a single consensus tree from the sets 
of replicate trees. In this process, the number of occur- 
rences of a particular edge from the replicate trees was 
converted to a single confidence score giving the final 
tree confidence values for each edge. 

A second set was also produced where domain se- 
quences were padded with homologue sequences from 
the SUPERFAMILY database. These were aligned, and 



trees created as with the original set. A script was used 
to remove the homologues from the trees leaving only 
the original domain sequences, but preserving all phylo- 
genetic relationships. The dataset calculated without 
homologues, with 300 replicates was chosen as very little 
difference was seen between the two replicate sets, and 
the addition of homologues sequences created larger 
alignments which were handled badly by the phylogen- 
etic algorithms. 

Structural phylogeny 

PDB style protein three dimensional structures for the 
same filtered SCOP 1.73 set of domains were taken from 
ASTRAL. The same filtered set of SCOP 1.73 domains 
as for sequence was used. Structal [17] was used to com- 
pare the 3D structures of every domain against every 
other in a superfamily, for all superfamilies in the set in 
a computationally expensive process of around 1.5 mil- 
lion structural comparisons. The Structal software was 
chosen from the large number of other structural com- 
parison methods due to its balance of speed and accur- 
acy for a computation of this kind. The Structal SAS 
scores (100*RMS/Number of positions matched) for each 
domain were used to create a matrix of structural dis- 
tances for each superfamily. The neighbour joining algo- 
rithm in the PAUP [22] package was used to compute 
phylogenetic trees from the distance matrices. 
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Level in SCOP 


Cellular Component 


Molecular Function 


Biological Process 


Total 


Sub Family 


30 


232 


137 




399 


Family 


48 


236 


197 




481 


Superfamily 


1 


22 


11 




34 


Above Superfamily 


55 


234 


211 




500 


All Levels 


134 


724 


556 


1414 



Figure 5 Level in SCOP of all single domain proteins associated with a specific GO term. Figure shows the level in SCOP at which all single 
domains associated with a particular GO term are found. I.e. if the group represents a family or superfamily. These are also broken down into the 
three main ontologies of GO terms. 



Functional phylogeny 

Gene ontology (GO) data from EBI GOA [23] was used 
to annotate domains with functional terms using the 
same set of domains that was used for the sequence and 
structure trees. For each superfamily, a binary presence/ 
absence matrix was generated of all GO terms versus all 
domains in the superfamily. The terms were treated 



independently of the hierarchy, but uninformative terms 
(present in all or present in only one domain) were ig- 
nored. For each superfamily the presence/absence matrix 
was used to generate a phylogenetic tree using PAUP 
neighbour joining. An additional set of functional trees 
was also generated using the same technique, but with 
functional data from Enzyme Commission (EC) numbers. 
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And so for family A. 
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Figure 6 An overview of the algorithm used to determine agreements/disagreements of trees with SCOP's groupings. Figure shows part 
of a tree built from domain sequences in a SCOP superfamily, and illustrates the algorithm involved in establishing if the tree agrees or disagrees 
with SCOP's family level grouping. 
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Testing against SCOP families 

Phylogenetic trees of domains in each superfamily pro- 
duced by each method could then be compared with the 
groupings at the SCOP family level An algorithm was 
produced to traverse the trees and identify if a particular 
edge agreed, disagreed or was uninformative with regard 
to SCOP families: 

• An edge of the tree is said to agree with SCOP if 
one side contains the full set of domains for a 
certain family and no members of another family. 

• An edge disagrees with SCOP when domains from a 
certain family are found on both sides along with 
domains from a different family. 

• A neutral or uninformative edge is where one side 
contains only members from a certain family, but 
not the complete set. i.e. more are found on the 
other side of the edge. 

An overview of the algorithm used is shown in Figure 6. 

Sequence divergence of domains in superfamilies 
and families 

Sequences for domains in SCOP 1.73 superfamilies were 
acquired from ASTRAL. Superfamilies containing a single 
domain only were removed. For each superfamily group- 
ing, sequence identities were sequentially calculated 
with Washington University BLAST [24], the highest se- 
quence identity members being removed until only the 
two most distant sequences remained. This process was 
repeated for domains grouped in families to give se- 
quence distance scores for all relevant families and su- 
perfamilies in SCOP. 

Functional divergence across SCOP 

For each GO term in the EBI GOA dataset a list of sin- 
gle domain proteins with the particular annotation was 
generated. The sequence identity of the two most distant 
sequences in the set was determined. The distribution of 
domains across the SCOP classification and level in the 
hierarchy for a specific functional annotation was also 
calculated, e.g. All domains contained within a specific 
family or superfamily. 

Additional files 



all members of a family, but not in other families. • Multi Family: Found in 
all members of more than one family, but not in all families in the 
superfamily. • Partial Family: Completes one or more families, is absent 
from one or more families and is incomplete from exactly one family of a 
specific superfamily. • Scattered Families: May or may not complete one 
family, more than one incomplete family, and at least one empty family 
per superfamily. • Scattered in Superfamily: Present in, but does not 
complete, all families in a superfamily. • Almost Superfamily: Present in all 
families, competes some of them. • Superfamily Equivalent: Present for 
every domain of just one superfamily. • Multi Superfamilies: Present in 
every domain in more than one superfamily. • Partial Superfamilies: 
Completes at least one superfamily; partially completes exactly one other 
superfamily. • Scattered Superfamilies: May or may not complete one 
superfamily; present but not completing at least one other superfamily. 

Additional file 2: Table S2. Statistics for the number of domains used 
in the phylogenetic analysis. Statistics for domains in raw SCOP 1.73 and 
after filtering to 95% sequence identity and removal of trivially solvable 
cases. 
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